French Multi Word Expressions: Using Data on Different Patterns for Extraction and Validation

نویسنده

  • Marie Dubremetz
چکیده

Mutlti Word Epressions (MWE) are an important problem in NLP. Many researchers use association measures for collecting and evaluating MWE candidates. In this paper we propose to check if it is legitimate to use those measures when data are only collected on one pattern of MWE (e.g. NounAdjective) for evaluating candidates belonging to an other pattern (e.g. NounNoun). For this purpose, we run tests on the French Europarl corpus. Using association measures extracted from NounAdjective patterns as features, we train a model that we evaluate on instances of Noun-Noun candidates. We notice with this method that the model will still evaluate correctly a quarter of the candidates. However the result tend to be lower.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Project proposal Automatic extraction and evaluation of MWE: adapting method to French Language Technology: Research and Development

Our project is based on the theme of Multi Word Expressions (MWE) we will focus on the problem of extraction. This task is important for improving lexical resources used for tasks such as tokenization, parsing or translation. In our study we will work on a French corpus. Our aim will be to not only select but also validate automatically which candidates are the true ones. If we have time we wil...

متن کامل

Extraction of Nominal Multiword Expressions in French

Multiword expressions (MWEs) can be extracted automatically from large corpora using association measures, and tools like mwetoolkit allow researchers to generate training data for MWE extraction given a tagged corpus and a lexicon. We use mwetoolkit on a sample of the French Europarl corpus together with the French lexicon Dela, and use Weka to train classifiers for MWE extraction on the gener...

متن کامل

Corpus-Driven Study of Multi-Word Expressions Based on Collocations from a Very Large Corpus

We present a corpus-driven approach to the study of multi-word expressions, which constitute a significant part of. As a data basis, we use collocation profiles computed from DeReKo (Deutsches Referenzkorpus), the largest available collection of written German which has approximately two billion word tokens and is located at the Institute for the German Language (IDS). We employ a strongly usag...

متن کامل

Feature selection using genetic algorithm for classification of schizophrenia using fMRI data

In this paper we propose a new method for classification of subjects into schizophrenia and control groups using functional magnetic resonance imaging (fMRI) data. In the preprocessing step, the number of fMRI time points is reduced using principal component analysis (PCA). Then, independent component analysis (ICA) is used for further data analysis. It estimates independent components (ICs) of...

متن کامل

Towards a mixed approach to extract biomedical terms from documents

The proposed work aims at automatically extracting biomedical terms from free text. We present new extraction methods taking into account linguistic patterns specialized for the biomedical field, statistic term extraction measures such as C-value and statistic keyword extraction measures such as Okapi BM25, and TFIDF. These measures are combined in order to improve the extraction process and we...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013